Data handling and curation - from raw to clean data

Aud Halbritter (adapted from Matt Grainger)
#| label: setup
#| echo: false
#| eval: true
#| message: false

library(tidyverse) 
library(palmerpenguins) 
library(janitor) 
library(knitr)

Data cycle

Data handling and curation is time consuming

Data cleaning vs Data wrangling

Data cleaning is the process of removing incorrect, duplicate, typos, or otherwise erroneous data from a dataset

Data wrangling changing the format to make it more useful for your analysis

Data cleaning with the Palmer penguins dataset

Exercise

Run the following commands.

#install.packages("palmerpenguins")
library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Clean names with janitor

penguins |> 
  names() 
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             
penguins_clean <- penguins |> 
  janitor::clean_names(case = "snake") 

penguins_clean |> 
  names()
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Get a glimps wiht dplyr

penguins_clean |>
  dplyr::glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Skim your data with skimr

out <- penguins_clean |> 
  skimr::skim() 

out
Data summary
Name penguins_clean
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Data cleaning

  • Do not touch raw data files
  • Clean data with code
  • Check each cleaning step

Keep track of changes with tidylog

Remove duplicates with dplyr

# A tibble: 4 × 2
  Day       Person
  <chr>     <chr> 
1 Monday    Becks 
2 Tuesday   Amy   
3 Wednesday Matt  
4 Wednesday Matt  
duplicate_data |> 
  dplyr::distinct()
# A tibble: 3 × 2
  Day       Person
  <chr>     <chr> 
1 Monday    Becks 
2 Tuesday   Amy   
3 Wednesday Matt  

Missing data

Missing data is often a problem, e.g. for running a model. Typically as ecologists we sweep missing data under the carpet by using a “complete case” approach to data analysis.

If you have ever written some code like this:

newdf <- na.omit(penguins_clean)

newdf <- penguins_clean[complete.cases(penguins_clean), ]

newdf <- penguins_clean |>
  tidyr::drop_na()

you are removing missing data (NAs) from your dataset.

Why is this a problem?

By throwing away potentially useful data (only including those rows without a NA in them) you reduce the information you are working with, reduce statistical power and introduce selection bias (invalidating any assumption of randomisation).

Vizualise missing data with naniar

penguins_clean |> 
  naniar::vis_miss()

Different types of missingness

Data can be missing random or systematically. Random missing data is less problematic.

library(naniar)
penguins_clean |> 
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_miss_point()

Data validation

#| label: validation
#| echo: true
#| eval: true
#| messages: false

library(validate)

rules <- validator(bill_length_mm >= 0, 
                   body_mass_g > 3000,
                   is.character(species),
                   sex %in% c("female", "male"))

Data validation

#| label: validation-summary
#| echo: true
#| eval: true
#| messages: false

out   <- confront(penguins_clean, rules)
summary(out)
  name items passes fails nNA error warning                    expression
1   V1   344    342     0   2 FALSE   FALSE  bill_length_mm - 0 >= -1e-08
2   V2   344    331    11   2 FALSE   FALSE            body_mass_g > 3000
3   V3     1      0     1   0 FALSE   FALSE         is.character(species)
4   V4   344    333     0  11 FALSE   FALSE sex %vin% c("female", "male")

Data validation

#| label: validation-plot
#| echo: true
#| eval: true
#| messages: false

plot(out)

Data wrangling with dplyr

Data wrangling with dplyr

Data wrangling with tidyr

Code style

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”

my_function <- function(my_data){
  
  my_data |> 
    group_by(group) |> 
    summarise(mean = mean(variable),
              se = sd(variable)/sqrt(n()))
  
} 

my_function<-function(my_data){
  my_data|> group_by(group)|> 
    summarise(mean=mean(variable),se=sd(variable)/sqrt(n()))} 

Useful package for style

library(styler)
 
style_text("my_function<-function(my_data){
  my_data|> group_by(group)|> 
    summarise(mean=mean(variable),se=sd(variable)/sqrt(n()))}")  

Useful package for style

Exercise

Use the raw penguine data and do the following tasks.

  • clean the names